NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FP-SMR: A Fully Digital Floating-Point Processing-in-SAS-MRAM for Session-based Recommender System

https://doi.org/10.1145/3716368.3735206

Ali, Asmer Hamid; Sridharan, Amitesh; Guo, Cheng; Hwang, William; Tsai, Wilman; Zhang, Jeff; Chen, Yiran; X_Wang, Shan; Fan, Deliang (June 2025, ACM)

With the rapid advancement of DNNs, numerous Process-in-Memory (PIM) architectures based on various memory technologies (Non-Volatile (NVM)/Volatile Memory) have been developed to accelerate AI workloads. Magnetic Random Access Memory (MRAM) is highly promising among NVMs due to its zero standby leakage, fast write/read speeds, CMOS compatibility, and high memory density. However, existing MRAM technologies such as spin-transfer torque MRAM (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM), have inherent limitations. STT-MRAM faces high write current requirements, while SOT-MRAM introduces significant area overhead due to additional access transistors. The new STT-assisted-SOT (SAS) MRAM provides an area-efficient alternative by sharing one write access transistor for multiple magnetic tunnel junctions (MTJs). This work presents the first fully digital processing-in-SAS-MRAM system to enable 8-bit floating-point (FP8) neural network inference with an application in on-device session-based recommender system. A SAS-MRAM device prototype is fabricated with 4 MTJs sharing the same SOT metal line. The proposed SAS-MRAM-based PIM macro is designed in TSMC 28nm technology. It achieves 15.31 TOPS/W energy efficiency and 269 GOPS performance for FP8 operations at 700 MHz. Compared to state-of-the-art recommender systems for the same popular YooChoose dataset, it demonstrates a 86 ×, 1.8 ×, and 1.12 × higher energy efficiency than that of GPU, SRAM-PIM, and ReRAM-PIM, respectively.
more » « less
Free, publicly-accessible full text available June 29, 2026
High-Density STT-Assisted SOT-MRAM (SAS-MRAM) for Energy-Efficient AI Applications

https://doi.org/10.1109/TMAG.2024.3486616

Xue, Fen; Hwang, William; Zhang, Fan; Tsai, Wilman; Fan, Deliang; Wang, Shan X (April 2025, IEEE Transactions on Magnetics)

Free, publicly-accessible full text available April 1, 2026
Thermal Characterization of Ultrathin MgO Tunnel Barriers

https://doi.org/10.1021/acs.nanolett.4c02571

Su, Haotian; Kwon, Heungdong; Xue, Fen; Sato, Noriyuki; Bhat, Usha; Tsai, Wilman; Bosman, Michel; Asheghi, Mehdi; Goodson, Kenneth E; Pop, Eric; et al (November 2024, Nano Letters)

Full Text Available
Efficient Memory Integration: MRAM-SRAM Hybrid Accelerator for Sparse On-Device Learning

https://doi.org/10.1145/3649329.3657390

Zhang, Fan; Sridharan, Amitesh; Tsai, Wilman; Chen, Yiran; Wang, Shan X; Fan, Deliang (June 2024, ACM)

With the prosperous development of Deep Neural Network (DNNs), numerous Process-In-Memory (PIM) designs have emerged to accelerate DNN models with exceptional throughput and energy-efficiency. PIM accelerators based on Non-Volatile Memory (NVM) or volatile memory offer distinct advantages for computational efficiency and performance. NVM based PIM accelerators, demonstrated success in DNN inference, face limitations in on-device learning due to high write energy, latency, and instability. Conversely, fast volatile memories, like SRAM, offer rapid read/write operations for DNN training, but suffer from significant leakage currents and large memory footprints. In this paper, for the first time, we present a fully-digital sparse processing in hybrid NVM-SRAM design, synergistically combines the strengths of NVM and SRAM, tailored for on-device continual learning. Our designed NVM and SRAM based PIM circuit macros could support both storage and processing of N:M structured sparsity pattern, significantly improving the storage and computing efficiency. Exhaustive experiments demonstrate that our hybrid system effectively reduces area and power consumption while maintaining high accuracy, offering a scalable and versatile solution for on-device continual learning.
more » « less
Full Text Available
Thermal optimization of two-terminal SOT-MRAM

https://doi.org/10.1063/5.0211620

Su, Haotian; Kwon, Heungdong; Hwang, William; Xue, Fen; Köroğlu, Çağıl; Tsai, Wilman; Asheghi, Mehdi; Goodson, Kenneth E; Wang, Shan X; Pop, Eric (July 2024, Journal of Applied Physics)

While magnetoresistive random-access memory (MRAM) stands out as a leading candidate for embedded nonvolatile memory and last-level cache applications, its endurance is compromised by substantial self-heating due to the high programming current density. The effect of self-heating on the endurance of the magnetic tunnel junction (MTJ) has primarily been studied in spin-transfer torque (STT)-MRAM. Here, we analyze the transient temperature response of two-terminal spin–orbit torque (SOT)-MRAM with a 1 ns switching current pulse using electro-thermal simulations. We estimate a peak temperature range of 350–450 °C in 40 nm diameter MTJs, underscoring the critical need for thermal management to improve endurance. We suggest several thermal engineering strategies to reduce the peak temperature by up to 120 °C in such devices, which could improve their endurance by at least a factor of 1000× at 0.75 V operating voltage. These results suggest that two-terminal SOT-MRAM could significantly outperform conventional STT-MRAM in terms of endurance, substantially benefiting from thermal engineering. These insights are pivotal for thermal optimization strategies in the development of MRAM technologies.
more » « less
Full Text Available
On-Device Continual Learning With STT-Assisted-SOT MRAM Based In-Memory Computing

https://doi.org/10.1109/TCAD.2024.3371268

Zhang, Fan; Sridharan, Amitesh; Hwang, William; Xue, Fen; Tsai, Wilman; Wang, Shan Xiang; Fan, Deliang (February 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Due to the separate memory and computation units in traditional Von-Neumann architecture, massive data transfer dominates the overall computing system’s power and latency, known as the ‘Memory-Wall’ issue. Especially with ever-increasing deep learning-based AI model size and computing complexity, it becomes the bottleneck for state-of-the-art AI computing systems. To address this challenge, In-Memory Computing (IMC) based Neural Network accelerators have been widely investigated to support AI computing within memory. However, most of those works focus only on inference. The on-device training and continual learning have not been well explored yet. In this work, for the first time, we introduce on-device continual learning with STT-assisted-SOT (SAS) Magnetic Random Access Memory (MRAM) based IMC system. On the hardware side, we have fabricated a SAS-MRAM device prototype with 4 Magnetic Tunnel Junctions (MTJ, each at 100nm × 50nm) sharing a common heavy metal layer, achieving significantly improved memory writing and area efficiency compared to traditional SOT-MRAM. Next, we designed fully digital IMC circuits with our SAS-MRAM to support both neural network inference and on-device learning. To enable efficient on-device continual learning for new task data, we present an 8-bit integer (INT8) based continual learning algorithm that utilizes our SAS-MRAM IMC-supported bit-serial digital in-memory convolution operations to train a small parallel reprogramming Network (Rep-Net) while freezing the major backbone model. Extensive studies have been presented based on our fabricated SAS-MRAM device prototype, cross-layer device-circuit benchmarking and simulation, as well as the on-device continual learning system evaluation.
more » « less
Full Text Available
Experimental Demonstration of Field-Free STT-Assisted SOT-MRAM (SAS-MRAM) with Four Bits per SOT Programming Line

https://doi.org/10.1109/LED.2024.3437352

Hwang, William; Xue, Fen; Song, Ming-Yuan; Hsu, Chen-Feng; Chen, T C; Tsai, Wilman; Bao, Xinyu; Wang, Shan X (January 2024, IEEE Electron Device Letters)

Full Text Available
Observation of anti-damping spin–orbit torques generated by in-plane and out-of-plane spin polarizations in MnPd3

https://doi.org/10.1038/s41563-023-01522-3

DC, Mahendra; Shao, Ding-Fu; Hou, Vincent D.-H.; Vailionis, Arturas; Quarterman, P.; Habiboglu, Ali; Venuti, M. B.; Xue, Fen; Huang, Yen-Lin; Lee, Chien-Min; et al (May 2023, Nature Materials)

Full Text Available
Accelerating Deep Neural Networks in Processing-in-Memory Platforms: Analog or Digital Approach?

https://doi.org/10.1109/ISVLSI.2019.00044

Angizi, Shaahin; He, Zhezhi; Reis, Dayane; Hu, Xiaobo Sharon; Tsai, Wilman; Lin, Shy Jay; Fan, Deliang (July 2019, 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI))

Nowadays, research topics on AI accelerator designs have attracted great interest, where accelerating Deep Neural Network (DNN) using Processing-in-Memory (PIM) platforms is an actively-explored direction with great potential. PIM platforms, which simultaneously aims to address power- and memory-wall bottlenecks, have shown orders of performance enhancement in comparison to the conventional computing platforms with Von-Neumann architecture. As one direction of accelerating DNN in PIM, resistive memory array (aka. crossbar) has drawn great research interest owing to its analog current-mode weighted summation operation which intrinsically matches the dominant Multiplication-and-Accumulation (MAC) operation in DNN, making it one of the most promising candidates. An alternative direction for PIM-based DNN acceleration is through bulk bit-wise logic operations directly performed on the content in digital memories. Thanks to the high fault-tolerant characteristic of DNN, the latest algorithmic progression successfully quantized DNN parameters to low bit-width representations, while maintaining competitive accuracy levels. Such DNN quantization techniques essentially convert MAC operation to much simpler addition/subtraction or comparison operations, which can be performed by bulk bit-wise logic operations in a highly parallel fashion. In this paper, we build a comprehensive evaluation framework to quantitatively compare and analyze aforementioned PIM based analog and digital approaches for DNN acceleration.
more » « less
Full Text Available

Search for: All records